As we learned in lecture, visualizing data can be tricky. This is especially true when it comes to data that is tied to spatial information, such as a map. Recall our example regarding election results—maps can be deceiving because it is easy to confuse the area information takes up on the screen with the magnitude of the underlying data.
In this code-along, we will strive to attach data to a map, and explore ways to visualize these data in ways that allow others to explore and interpret the data intuitively.
Goal: Visualize data using maps, and maybe even time.
We will need several packages to help us through this journey. Install any that you have not previously used.
library(tidyverse) # data exploration toolset
library(gganimate) # visualizing through time
library(lubridate) # exploring time data
library(sf) # makes mapping easy...er
library(cartogram) # transform spatial objects
library(rayshader) # shine a light on data
library(rgl) # render data viz interactively
library(viridis) # nice color palettes
library(DT) # interactive data tables
library(plotly) # interactive data plots
library(maps) # map tools
We will be visualizing the total COVID19 case counts. I provided the cases curated by the New York Times within their open github repository.
# load the data into our environment. Remember to check it readr guessed the classes
# of the columns correctly!
NYTimes_COVID19_data <- readr::read_csv(file = "data/us-states.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## date = col_date(format = ""),
## state = col_character(),
## fips = col_character(),
## cases = col_double(),
## deaths = col_double()
## )
We will also want to normalize the case counts by state population. I provided the data, and how I obtained it.
# ran tidycensus::census_api_key() first with key sent to inbox.
# More info: https://walker-data.com/tidycensus/articles/basic-usage.html
## Ran this once and commented out so I do not have to keep bothering
## the API
# state_pop <-
# tidycensus::get_acs(geography = "state",
# variables = "B01003_001",
# year = 2018,
# geometry = F)
# readr::write_rds(state_pop,file = "data/state_pop.rds")
state_pop <- readr::read_rds(file = "data/state_pop.rds")
# Glimpse at our data.
# Notice how every state is not represented for every date. We will fix that.
glimpse(NYTimes_COVID19_data)
## Rows: 23,004
## Columns: 5
## $ date <date> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-24, 20…
## $ state <chr> "Washington", "Washington", "Washington", "Illinois", "Washingt…
## $ fips <chr> "53", "53", "53", "17", "53", "06", "17", "53", "04", "06", "17…
## $ cases <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, …
## $ deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# Plot a variable. Does this look sensible? Is this data cumulative or new cases "per day"
NYTimes_COVID19_data %>%
filter(state=="Massachusetts") %>%
ggplot() +
geom_point(aes(x=date,y=cases))
The data are cumulative, let’s use the trick we discussed in the first code-along to make it new cases per day
NYTimes_COVID19_data <- NYTimes_COVID19_data %>%
group_by(state) %>%
# subtract cases from previous cases from this state
mutate(new_cases = cases - lag(cases, default = 0)) %>%
# take out any weird data brought on by changes in reporting standards, etc.
filter(new_cases >= 0) %>%
ungroup() %>%
# "complete" missing data for unreported days, assuming zero new cases those days
# We will be averaging cases using a rolling mean, so cases missed on day and
# added onto the next day will average out.
tidyr::complete(date = tidyr::full_seq(date,period=1),
state,
fill=list(new_cases = NA))
# Does this look like the "new cases per day" plot we expect?
NYTimes_COVID19_data %>%
filter(state=="Massachusetts") %>%
ggplot() +
geom_line(aes(x=date,y=new_cases))
## Warning: Removed 11 row(s) containing missing values (geom_path).
Let’s add the rolling mean variable, which averages out the daily variation throughout the week that we discovered previously
What would this look like if we tried to visualize the relationship between cases and states over time?
NYTimes_COVID19_data %>%
ggplot() +
geom_line(aes(x=date,y=new_cases,color=state))
## Warning: Removed 2296 row(s) containing missing values (geom_path).
Hmmm…
What about for a certain day? Can we get a more readable graph?
my_date <- "2020-07-01"
NYTimes_COVID19_data %>%
filter(date == my_date) %>%
ggplot() +
geom_col(aes(y=new_cases,x=state, fill=new_cases)) +
theme_minimal() +
theme(axis.text.x = element_text(angle=90,hjust=1,vjust=0.5))
We can make more sense of the data this way. But, it is still difficult to quickly find data you may be interested in, and more importantly, “see” the data in space. This is an epidemic—we need to be able to see where the cases are. For that, we need to map the data onto… a map.
# we need to get the geospatial data
usa <- st_as_sf(maps::map("state", fill=TRUE, plot =FALSE)) %>%
sf::st_transform(crs = 4236)
usa
## Simple feature collection with 49 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.6785 ymin: 25.13392 xmax: -66.99729 ymax: 49.38935
## Geodetic CRS: Hu Tzu Shan 1950
## First 10 features:
## ID geom
## 1 alabama MULTIPOLYGON (((-87.45513 3...
## 2 arizona MULTIPOLYGON (((-114.6336 3...
## 3 arkansas MULTIPOLYGON (((-94.04464 3...
## 4 california MULTIPOLYGON (((-120.0027 4...
## 5 colorado MULTIPOLYGON (((-102.0493 4...
## 6 connecticut MULTIPOLYGON (((-73.48976 4...
## 7 delaware MULTIPOLYGON (((-75.79354 3...
## 8 district of columbia MULTIPOLYGON (((-77.12873 3...
## 9 florida MULTIPOLYGON (((-85.00834 3...
## 10 georgia MULTIPOLYGON (((-80.8826 32...
Adding our case estimates to our geospatial data:
NYTimes_COVID19_data %>%
mutate(ID = tolower(state)) -> NYTimes_COVID19_data
covid_map_data <- left_join(x = usa, y = NYTimes_COVID19_data, by="ID")
# geom_sf finds the spatial data within covid_map_data automatically
ggplot(covid_map_data %>% filter(date == my_date)) +
geom_sf(aes(fill=new_cases))
# we can use a color palette that is more friendly for telling differences
# in magnitude
covid_map_data %>%
filter(date == my_date) %>%
ggplot() +
geom_sf(aes(fill=new_cases)) +
scale_fill_viridis(option = "plasma") +
labs(title=paste("New COVID19 cases on", my_date)) +
theme_bw()
# What if we wanted to visualize the number of new cases normalized by the per-state
# population?
covid_map_data <- covid_map_data %>%
left_join(state_pop, by=c("state"="NAME")) %>%
mutate(new_cases_per_1e5people = (new_cases*1e5)/estimate )
ggplot(covid_map_data %>% filter(date == "2020-07-01")) +
geom_sf(aes(fill=new_cases_per_1e5people)) +
scale_fill_viridis(option = "plasma")
# we can add points via their long/lat coordinates
EC_local <- tibble(long=-71.10307267670433,lat=42.340854411959974,name="Emmanuel College")
# we just have to convert them to the sf class first
EC_local <- EC_local %>%
sf::st_as_sf(coords = c("long", "lat"), crs = 4236)
# adding points
ggplot(covid_map_data %>% filter(date == "2020-07-01")) +
geom_sf(aes(fill=new_cases_per_1e5people)) +
geom_sf(data = EC_local, size=2,color="red", shape=23) +
scale_fill_viridis(option = "plasma")
# zooming in using coord_sf()
ggplot(covid_map_data %>% filter(date == "2020-07-01")) +
geom_sf(aes(fill=new_cases_per_1e5people)) +
geom_sf(data = EC_local, size=2,color="red", shape=23) +
scale_fill_viridis(option = "plasma") +
coord_sf(xlim = c(-80,-70),ylim = c(40,45), expand = F) +
theme_bw()
# we can convert the states to circles with size proportional to a variable of interest
cart <- cartogram::cartogram_dorling(covid_map_data %>%
filter(date == my_date) %>%
sf::st_transform(crs = 5070), "new_cases_per_1e5people")
ggplot(cart) +
geom_sf(aes(fill=new_cases_per_1e5people)) +
scale_fill_viridis(option = "plasma") +
theme_minimal()
# we can adjust the size of the states such that they are proportional to the variable of interest
cart2 <- cartogram::cartogram_ncont(covid_map_data %>%
filter(date == my_date) %>%
sf::st_transform(crs = 5070), "new_cases_per_1e5people")
ggplot(cart2) +
geom_sf(aes(fill=new_cases_per_1e5people)) +
scale_fill_viridis(option = "plasma") +
theme_minimal()
Sometimes, it is difficult to “see” the data regardless of geospatial alignment. For instance, in the map below, do you know the name of every state? Can you tell the exact value of the new cases per 100,000 people in each state?
ggplot(covid_map_data %>% filter(date == my_date) %>% st_transform(crs = 5070)) +
geom_sf(aes(fill=new_cases_per_1e5people)) +
scale_fill_viridis(option = "plasma") +
theme_bw()
We can compliment our visualizations with the actual data!
covid_map_data %>%
as_tibble() %>%
filter(date == my_date) %>%
select(date, state, new_cases,new_cases_per_1e5people) %>%
arrange(desc(new_cases_per_1e5people)) %>%
mutate(new_cases_per_1e5people = round(new_cases_per_1e5people,2)) %>%
DT::datatable()
Or maybe you want your data directly on the map!
my_map <- ggplot(covid_map_data %>%
filter(date == my_date) %>%
mutate(new_cases_per_1e5people = round(new_cases_per_1e5people,2)) %>%
st_transform(crs = 5070)) +
geom_sf(aes(fill=new_cases_per_1e5people, text=state)) +
scale_fill_viridis(option = "plasma") +
theme_bw() +
labs(title=paste("New COVID19 cases on", my_date,"\nI am interactive! Click around."))
## Warning: Ignoring unknown aesthetics: text
plotly::ggplotly(my_map) %>%
style(hoverlabel = list(bgcolor = "white"), hoveron = "fill")
# plug our ggplot into the plot_gg function from the awesome rayshader package
plot_gg(my_map, width = 5, height = 5, multicore = TRUE, scale = 250,
zoom = 0.6, theta = 0, phi = 90, windowsize = c(800, 800))
# wait a bit
Sys.sleep(0.2)
# render the output as a widget so it embeds within our html output.
rgl::rglwidget()